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I. INTRODUCTION 

Anisotropic discretisations [H, 0, have proved an extremely useful tool for determining the energy spectrum 
of lattice field theories using the Monte Carlo method. Four dimensional Euclidean space-time is discretised with 
separate grid spacings as and at for the spatial dimensions and the temporal direction with ^ = as/at- The spectrum 
of the theory is determined by measuring the correlation function between operators on the fields in a Monte Carlo 
calculation and performing a statistical analysis to determine the rate of decay of these functions. This fall-off is related 
to the energies of eigcnstates of the quantum-mechanical hamiltonian. A fine resolution in the temporal direction is 
crucial to make an accurate determination of energies, particularly when considering massive or highly excited states. 
The advantage of the anisotropic lattice is that the rapid rise in numerical effort needed to reduce the lattice spacing 
is ameliorated by just reducing the spacing in the time direction. The Hadron Spectrum collaboration is currently 
using these lattice regularisations in a large-scale programme 0j S aiming to measure the QCD spectrum to high 
precision. The calculations by the collaboration include the full dynamics of the lightest three quark fields and in 
these investigations, the dominant source of computational overhead is solving the linear system corresponding to 
quark propagation in a given gauge field background. Quark propagation is described by finding solutions to a lattice 
representation of the Euclidean-space Dirac equation: 

{p+m)^P = lJ, (1) 

where p is the Dirac operator in the presence of the gauge fields. The Hadron Spectrum collaboration has used 
anisotropic lattices with Wilson and Sheikholeslami-Wohlert discretisations of the Dirac operator. These solutions, 
■0 must be computed during application of the Markov process that generates the ensemble of gauge fields for Monte 
Carlo importance sampling as well as during the measurement phase. The vector space in which the solution is to be 
constructed is sufficiently large that direct methods are impractical and instead iterative solvers must be employed. 
While the anisotropic lattice gives substantial benefits in statistical precision the disadvantage is that the inclusion of 
a fine discretisation scale; at; increases the condition number of the linear system to be solved. This naturally leads 
to an increase in the number of iterations needed in the solver. A well-established means of improving efficiency is to 
precondition the problem. 

A quick examination reveals that the dominant terms in the coefficient matrix of the linear system represent the 
discretisation of the temporal derivative in the Dirac operator. In this work, operators that directly inverts this term 
are constructed and used to form a preconditioner. Once the means of applying the inverse of the temporal term in 
the Wilson or SW operator has been defined it is seen that further preconditioning using Schur or ILU factorisation 
can be applied. To test this idea, the condition numbers of temporally preconditioned operators are computed on 
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several gauge field ensembles. This test uses quenched ensembles since they are substantially easier to generate with 
a range of different anisotropies than fully dynamical ones, and the issue of determining the parameters in the lattice 
action needed to give a particular physical anisotropy ^ is greatly simplified. The temporal preconditioned matrices 
to be inverted have condition numbers approximately twenty times smaller than their unpreconditioned counterparts 
and about three to four times smaller than the commonly used even-odd lattice Schur preconditioning when the 
anisotropy ranges from three to six. Taking into account numerical overheads, this translates to cost reductions of 
around 1.7-2.4. No advantage is found using temporal preconditioning in the isotropic case. 

This paper is organized as follows: in section[TT]we set up our notation, and describe in detail the basics of temporal 
preconditioning and how it may be combined with further even-odd preconditioning. In section IIIII we detail our 
numerical investigations, including the tuning of our fermion anisotropy parameters and a presentation of condition 
numbers for various kinds of preconditioning schemes. We sum up and draw our conclusions in section HVl 



II. THEORY 



The Wilson fermion matrix and its C'(a)-improved Sheikholeslami-Wohlert (SW) version can be defined on an 
anisotropic lattice with three coarse spatial directions with spacing Og and a fine temporal spacing at- In practice, the 
anisotropy is introduced into the simulation by modifying the couplings in the lattice action that weight temporal and 
spatial components. There relative weights are denoted jg for terms in the gauge action and 7/ for the fermions. These 
parameters are tuned carefully to ensure Lorentz invariance is restored in long-distance physics. If these parameters 
are determined using a perturbative expansion, then at the tree level they are 7/ = 7g = ^- Since practical simulations 
will be performed close to the continuum limit, this relationship should hold to within approximately 25%. Following 
the notation of [J , we will denote the tuned values of these parameters as 7* and respectively. 

With these definitions, the fermion matrix is defined as the sum of the following terms: 

M = A + fiS,,y- Dt- —Ds, (2) 
7/ 

with 



= E C-Y^U,i-K+ly + ^-^Ulix - l)sA , (4) 
i=i ^ ^ 

where 7^ for i = 0, 1, 2, 3 are appropriate spin matrices and fi is an anisotropic mass term corresponding to bare mass 
Too defined as fj, = atrriQ + 1 + 

In the case of Wilson fermions, the term A ~ Q while for the SW improved fermion matrix the mass term ^ and 
the hopping terms Dg and Dt are unchanged while the corresponding expression for A is 

A = -5x,y I ^ ^ (JmFio{x) + ^ E ^^^^^^ f ' 

i=l ij J 

where cry = ^[7i,7j] is a commutator of the spin matrices and Fij{x) is the anti-hermitian "clover-leaf" term 
constructed from the plaquettes in the i,j plane emanating from from site x as defined in Q and elsewhere. 
The tree level tadpole improvement coefficients and Ct are defined as 



3.,_' ^t = :^{^ + -], (6) 



where Us is the spatial tadpole factor computed from the spatial plaquettes of our gauge ensembles Uss as 

Us^{lUss)K (7) 



and we have set = 1. The isotropic case is recovered when one sets = Wt, 7g = 7/ = 1 and so Cs = ct are the 
same tree level improved tadpole coefficient. 
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As in the isotropic case, these linear operators act on vectors in an Nc x iV-^ x iVs-dimensional space of complex 
fermion fields, where is the number of colors in the gauge group, = 4 is the number of components in a spin-1/2 
representation of the group of four-dimensional Euclidean rotations and A^^ the number of sites on the four-dimensional 
lattice. 

The additional terms in the SW action are all diagonal in the spatial and temporal lattice co-ordinates. With the 
matrix expressed in this form, the largest terms in M are contained in A, Dt and fi, and the contribution from terms 
in Dg is smaller by a factor of the (bare fermion) anisotropy. This suggests an efficient preconditioning for this matrix 
can be constructed if a way of applying the inverse of A + — is known. Notice this matrix is block-diagonal in the 
spatial lattice co-ordinates, so solving this problem requires being able to invert a one-dimensional lattice operator. 



A. A temporal preconditioner 

We now proceed to present the technique of the temporal preconditioning. In the discussion below, we will initially 
focus on the Wilson case {A — 0) for simplicity. We will then comment upon the strategies available to deal with the 
full SW case. We begin by defining temporal spin-projection operators 

^± = ^(l±7o), (8) 

and a temporal hopping matrix at every spatial site on the lattice 

Tt,t'{x) = A"5t,t' - Uo{x, t)5t+i,t', t - ... Aft - 1, (9) 

where Nt is the number of lattice points in the temporal direction of the lattice. The operator T is gauge covariant 
with no spin structure. Further, T can have a variety of boundary conditions. We will consider primarily the cases 
where T is either periodic or anti-periodic where 

TNt-i,Nt~i{x) = (10) 
To,jv.-i(.t) = ±Uo{x,Nt-l), (11) 

and the sign in Eqn.[TT]is positive or negative when the boundary conditions are respectively periodic or anti-periodic. 
We now define operators on all spin components, 

C^^ =P+ + P-T and = P_ + P+T\ (12) 

where the invertability of these matrices has been assumed. With these definitions, the temporal hopping term can 
be expressed as 

M - A = . (13) 

Constructing CJ^^ and C^^^ with the spin-projector structure given above ensures they obey the relation 

CV = 75C«S5, (14) 

which will allow the construction of a preconditioned matrix that maintains 75 hermiticity. Eqn. [T2] now shows how 
Cl and Cr will make a useful preconditioner for the Wilson fermion matrix on the anisotropic lattice. The fermion 
matrix can be written as 

M = Ci'ClMCrC^^ = Cl'MC],\ (15) 
with M = C^MCj^. For Wilson fermions, 

M ^ I - —ClDsCr, (16) 
7/ 

and the preconditioned matrix is equal to the identity plus terms proportional to I/7/ only. 

The operation of M on a fermion field requires the operation of Cl = {P+ + P-T)~^. Since P+ and P- define 
orthogonal projectors, this inverse can be re- written as Cl = P+ + P-T^^ and the application of Cl is reduced to 
finding the inverse of T. If the lattice fields had Dirichlct boundary conditions, T (a lattice representation of the 
forward-difference operator) could be inverted easily by back-substitution, starting at the open boundary. Most lattice 
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calculations use periodic or anti-periodic boundary conditions however and so back-substitution is insufficient. The 
Sherman-Morrison- Woodbury (SMW) [1, [1, [l^l formula provides a simple means of inverting a matrix that differs 
in a small number of elements from another matrix whose inverse is computationally cheap to apply. The forward 
difference operator with periodic boundary data, T can be written in terms of the difference operator with open 
boundaries, Tg and a correction term: 



with 



T = Ta + VW\ (17) 

irj. 1 / ^^5t.t' - Uq{x, t)5t+i,t' if t = 0..Nt-2 , . 

iTokt'^'^)~[^,St,t' if ^^^^ 

and V and W are NcNt x Nc column matrices where the only non-zero entries are in either the first or last sites: 

Vt{x) = -Uoix,t)St^N,-i, (19) 

and 

Wtix)=St,o- (20) 

Since T, Tq, V and X are defined on each spatial site x, we will suppress the spatial index in subsequent discussion 
except where it may be needed for clarity. With these definitions, VW^ is a rank Nc correction that adds the effects 
of the boundary condition back in to the open-boundary-data operator. The SMW formula then gives an expression 
for the inverse of T defined in Eqn. [HI Defining X = Tg" V yields 

T-^ =Tf^^ - X{I + W'<X)-'^W''. (21) 

All that remains to be evaluated is (/ -I- W'^X)^^ but note that this is a small (rank Nc) matrix at each spatial site 
whose inverse is straightforwardly computed. In practice, the algorithm proceeds as follows: 

1. Prior to use, the preconditioner is initialized. On each spatial site x, the expression 

(a) X ~ Tf^^V is computed by back-substitution and 

(b) A = (/ -f- W^X)'^ is subsequently computed. 

All these results arc stored. A requires storage of just Nc x Nc complex numbers per spatial site, while X 
requires Nc x NcNt. This is smaller storage requirement than a single fcrmion field. Also, we note that one can 
immediately compute AA at this point which also requires storage of Nc x NcNt per spatial site but which may 
overwrite the original X . 

2. Given a particular right-hand side 77, computing ifj = T^^rj requires first evaluating 

(a) X ~ Tq^V by back-substitution, then 

(b) q — AW'^x- Note that W^x is just the A^'c-component vector on time-slice t = of vector x and so 
evaluation of W'^x is computationally trivial. Finally, the solution is formed: 

(c) ^^x-Xq 

The back substitution process for X can formally be carried out analytically. For the case of T one has: 

Xix,Nt~l) = --Ut{x,Nt^l), (22) 



^ Nt-l 

X{x,t) = — J] (7t(f,j)forO<i< At-1, (23) 

^ j=Nt-t 

and successive terms in X are suppressed by powers of the mass term /x, and the matrix product forms a series which for 
A (a?, 0) is the Polyakov loop. In principle; for large enough Nt \ one could ffiid some k such that for <t < k one has 
A(x, i) « numerically, and one may then save some numerical effort by just setting those values of X{x, i) = and 
not evaluating matrix products using them but setting them to zero also. Since the link matrices in the components 
of X are SU{?>) one knows that their product is also SU{'i) and hence one can find the norm of X{x,i) as 

||A(f,t)||=V3Ai-(^*-*), (24) 
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where the factor of -s/S comes from the SU{3) nature of the link matrices. We will refer to the cutting off the 
computation of X(x, t) for sufficiently small values of t as the cutoff trick. Caution should be used however to ensure 
there is no impact in the precision of final result. 

The inverse of T\ required for the operation of Cr is formed in the same way, using appropriate redefinitions of 
X, W and V and with forward-substitution solves. We note that in step 1(b) above, only an Nc x Nc complex matrix 
needs to be inverted per site. The cutoff trick also works, but now the terms are least suppressed at i = (open 
end of the forward substitution) and most suppressed at i = iVt — 1, building up to the Hermitean conjugate of the 
Polyakov loop for X{x, Nt — 1), and the Nt~t term in eqn. [M] needs to be replaced with t. 

Let us now comment on the case for SW fcrmions. Proceeding as above, the entire procedure is valid, but the 
preconditioned matrix changes to: 

M = ClMCb, = I + Cl(^A- ^D^j Cr . (25) 

It then becomes tempting to extend the definition of C^^ and Cf,^ in such a way that 

A + ^ji-Dt = Cl^C„\ (26) 

so that the preconditioned matrix would maintain its original / — -^ClDsCr form, even in the case of a general, 
non-zero SW term, i.e. A ^ Q. The practical difficulty with this approach is that the ct^^ terms in A couple all spin 
components and the forward and backward difference operators in Dt cease to be directly separable. Correspondingly 
the construction of suitable Cl and Cr terms would require the inversion of a block- tridiagonal matrix, with N^Nc x 
N^Nc sized blocks rather than just Nc x Nc- Further, the back/forward substitutions fill-in these blocks destroying 
the site wise block-diagonal structure of the SW matrix and so. the inversions of the diagonal blocks would need dense 
inversions of the full N^Nc dimensional sub-blocks. We will refer to this approach as full SW temporal preconditioning. 
The details of this approach arc discussed in the appendix. 

Nonetheless, even if one just uses the same Cl and Cr as for the Wilson action and suffers the contamination from 
the SW term in the preconditioned matrix M in eq. (P5)) . it can be seen that the Dt term is still inverted, and that 
the Ds terms are suppressed by a factor of . Hence one can expect that this form of preconditioning is still more 
effective than using the unpreconditioned operator. In what follows we will refer to this approach of using the Cl 
and Cr preconditioners from the case of the Wilson fermion matrix to prcconditione the SW operator, as partial SW 
temporal preconditioning. 



B. Combining the temporal preconditioner with other schemes 

The usual isotropic Wilson and SW operators are efficiently preconditioned by considering a Schur decomposition 
after first ordering lattice sites according to their four-dimensional co-ordinate parity, P4{x) = (—l'^^o+xi+x2+x3^ 
has 

with M"^'-""^ = + ^, and Ar° = -D^", = + Dt being the 4-dimensional Wilson Dslash operator, 

and the Schur preconditioned matrix is 

This kind of Schur preconditioning with a p4 ordering, is the standard even-odd preconditioning method in use for 
Wilson and SW fermions, and we shall refer to it as 4D-Schur preconditioning from now on. 

This idea can be combined with the temporal preconditioner with a small modification; instead of ordering lattice 
sites by a four-dimensional parity, a three-dimensional equivalent is used, psix) = (— l)^i+^2-l-a;3 ^ j'i^q preconditioned 
matrix M retains its form in terms of Af'^'^(°°) and M'^°(°'^\ however these now change their meaning slightly, since 
with this ordering, the operators A and Dt connect lattice sites with the same while the spatial hopping matrix 
couples sites with opposite p^. and 

pjee{oo) _ ^ee(oo) ^ ^ _ jj<^e{oo) ^ j^eo{oe) _ _ _}_ j^eo{oe) ^ |'2gj 

7/ ' ' 
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For the Wilson action, the inverse of the block matrix, Af = — D^'^ is formed using the method described 
in the previous section. For full temporal preconditioning in the clover case, one would need to form the inverse of 
Af — A'^'^ + /i — Df'^ the difficulties with which have already been discussed, and we can refer to the result as temporal 
preconditioning combined with 3D-Schur preconditioning. Alternatively, one can proceed with partial preconditioning 
for the Clover case, using the p^ ordering preconditioner appropriate for the Wilson action. In this case we can refer 
to the result as temporal preconditioning combined with 3D incomplete lower-upper (ILU) preconditioning. 

First, we define the action of the left temporal preconditioner on the even 3-parity sub-lattice to be C£ and define 
correspondingly the right preconditioner and both their odd sub-lattice counterparts to be C|j,,C£,C^. We also 
introduce the notation that for some generic term K the corresponding term K is defined as K ~ ClKCr. Hence, 

Df = ClDTC-k. (30) 

To keep the discussion below general, we will also define the operator 

Q = CL{A + ^i-Dt)CR. (31) 

In the case of Wilson fermions, ^ = and so Q = /. For SW fermions with full SW preconditioning one also has 
(5 = 1 while with partial SW preconditioning. 

0=1 + ClACr = 1 + A. (32) 

Q is diagonal in terms of p^ even-odd indices, and we may refer to its even-even (odd-odd) sub blocks Q'^'^'^°°^ using 
(j<o)^ (^e(o) g^^^ j^ee(oo) needed. 

With these expressions, the even-odd preconditioncrs for the Wilson matrix become 

^.^(ifllc£cO"^*''-(?*1''n. (33) 



which gives 



Ms = SlMSr = ( _ J_5oe [1 _ geej Qoo _ ^^^oe ^3 _ Q^ej Qeo 1 • (34) 

If Q = 1, as is the case of Wilson fermions or Clover fermions with full temporal preconditioning, this matrix 
reduces to 

A/3 - I Q ^ _ -^Df Lif I ' (35) 

so that the preconditioned matrix differs from the identity only by terms proportional to In the case of partial 

''f 



SW preconditioning, where Q = 1 + A, the preconditioned matrix is 

1 + A"" -^A'^'D 



eo 



- 1^ J_£)oe^ee 1 + ^oo _ Y ^oe _ ^eej j^eo j ■ (36) 

We can see from eq. [5^1 that in contrast to full preconditioning in eq. [35l we now have non-zero off diagonal elements 
(in even-odd space) that are only suppressed by 7/. Further the diagonal elements contain components proportional 
to at in the Clover terms A. These terms can counter potential suppression by j'j in the odd-odd checkerboardcd 

term of Af 3 . 

To complete this discussion, we note that effective use of the preconditioner requires the inverses S^^ and 5*^^ 
to be applied so the solutions with the unpreconditioned matrices may be obtained. Using the definitions of Cj^^j^^^ 
restricted to the even and odd sites respectively we have 
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C. Numerical cost of ILU Scheme 



We consider the partial temporally preconditioned scheme, combined with ILU even-odd preconditioning to be 
potentially the most attractive, since it is simple in terms of implementation and is equivalent to the full 3D Schur 
preconditioned scheme in the case of Wilson fermions. However, applying the preconditioners does incur some nu- 
merical overhead. The overhead depends to some degree on details of the implementation of the method. We will 
consider two implementations below. 

First we consider the naive implementation of the method, with no gauge fixing and in spinor basis where 70 is 
not diagonal. This could be the case in a general code, using a chiral spin-basis such as the Chroma software system 
PH . Neglecting the cost of preparing the sources and recovering the solutions (using S]^^ and S]^^) we can compare 
costs of the usual 4D Schur preconditioning and the partially temporally preconditioned ILU scheme, which we will 
denote as C(M4DSchur) and C(Milu) respectively, by counting the floating point operations (FLOPs) in the respective 
preconditioned linear operators. 

Relegating the actual counting of FLOPs to appendix fSl we merely state here that the ratio of floating point costs: 

^^^(MiLu^^ (38) 

C(M4DSchur) 

is 

^ 3432iV,- 576 ^ 0.216 

^ - 26687V, -1-286-^, (39) 



for Wilson fermions, and 



6012Ar,- 1152 ^ 009 

3732iVt Nt ^ ^ 



for Clover fermions respectively, resulting typically in about a 29% overhead for Wilson, and 61% overhead for Clover 
from the ILU scheme in terms of FLOPs as compared to the standard 4D Schur even-odd scheme. This must be 
matched by the gain in terms of condition number from the preconditioner for it to remain competitive. 

Concurrent with writing this paper, some clever optimization techniques were brought to our attention by the 
authors of [13, Ell arising from work with General Purpose Graphics Processing Units (GPGPUs). These techniques 
save both memory bandwidth and FLOPs. The first technique we consider is to fix the gauge prior to the inversion 
proess, to the Axial gauge (temporal gauge). The effect of this operation is that all the links that are not on the 
temporal boundary are transformed to the unit matrix: Uo{x, t) = 1 for t = . . . A^i — 2. This can save fioating point 
operations in the back(forward) substitutions with To, where in each step one can save an SU(3) matrix-color vector 
multiply. Further, one can save on memory requirements since the block vector X is simplified to 

Xix,t)^^^Uo{x,Nt-l), (41) 

with the other terms in the partial Polyakov loops now being the identity. Correspondingly, instead of storing all 
of X, one can easily compute any component of it from the boundary link matrix which one stores anyway. Fixing 
to the axial gauge is a straightforward operation which can be amortized over either one and especially over several 
solves. 

The second trick that can prove useful is to employ the Dirac-Pauli spin basis in which 70 is diagonal. This 
simplifies the projector operators P-f and so that they select the top or bottom two spin components of four 
spinors respectively. This can save FLOPs on spinor reconstuction (which now no longer needs to be done in the time 
direction) but also when one has a sum of the form 

S = P+i' + P_x, (42) 

one has no arithmetic to perform, since the spin components filtered by the projectors can be written directly into 
the correct components of S without requiring any addition. We enumerate in explicit detail the savings from these 
two implementation techniques in appendix [21 It is shown there that employing both of these techniques can save 
roughly 22-25% in terms of FLOPs over the naive implementation. 

When compared to the 4D-Schur preconditioned scheme, which also benefits from these improvements, we find that 
the techniques result in relative overhead ratios of: 



(43) 
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for Wilson fcrmions, and 



4476iV. - 96 0031 

31087Vt Nt ' ^ ^ 



for Clover fermions. In particular, the relative overhead for the Clover operator appears to be substantially reduced 
compared to the naive implementation (44% as opposed to the previous 61%). The foregoing discussion does not make 
use of the cutoff trick, which can be used to further reduce the floating point costs of the preconditioned operators as 
discussed earlier. 



III. NUMERICAL INVESTIGATION 



A. Strategy and Choice of Parameters 



Wishing to investigate the efficacy of the preconditioning strategies as functions of both lattice anisotropy and quark 
mass in as realistic a setting as possible, we have opted to measure the condition numbers of the various operators 
in three quenched ensembles. These ensembles were chosen to have target (renormalized) anisotropies of ^ = 1.0, 
^ = 3.0 and ^ = 6.0 respectively, thus ranging from the fully isotropic to the highly anisotropic. We picked a wide 
range of quark masses, to give pion masses in the range of about 450 MeV to 750 MeV. We note that this necessitated 
us tuning the fermion anisotropies 7/ so as to make the renormalized anisotropies ^ (as fixed by the pion dispersion 
relation) the same as our target anisotropies ^, a subject wc will discuss in more detail further on. 

During our study, we have opted to keep the physical temporal extent of the lattice fixed in the time direction. This 
approach means that as we increased the anisotropy, wc likewise increased the number of lattice points in the time 
direction. We have used iV^ = 16, A't = 48, A^t = 96 for the anisotropies of ^ = 1.0, ^ = 3 and ^ = 6 respectively. This 
increase of the temporal resolution may have an effect on the condition numbers of our operators. We felt however, 
that this is typically the approach one would use in a real calculation, rather than keeping the temporal extent fixed, 
and that we should absorb this effect in our efficiency estimates, in order to give a realistic measure of the performance 
of the preconditioning. 



B. Code and Computers 



We coded the temporal preconditioned Wilson-Clover operators, in the Chroma [Tl| software system. We imple- 
mented both 3D ILU and 3D Schur even-odd preconditionings in space, in combination with the temporal precondi- 
tioning. In the case of the 3D Schur even-odd preconditioning, we used an inner Conjugate Gradients solve, to invert 
M^*^ in the Schur complement. We also used the unpreconditioned and 4D Schur Even-Odd preconditioned operators 
already present in the Chroma suite to measure reference results. Our implementations used the naive implementation 
technique discussed in sec. Ill CI In order to tune the fermion anisotropics, we carried out some hadron spectroscopy 
calculations, in particular the measurement of the pseudoscalar correlation functions at various momenta. In order to 
measure the condition numbers of the square operators, we used the Ritz- minimization technique of p^ . Both these 
sets of measurements are standard within the Chroma distribution. Fitting of our spectroscopy results used the so 
called 411 code developed by UKQCD. Our calculations were carried out on the Jefferson Lab 6n and 7n clusters. 



C. Selecting the Gauge Action Parameters 



We chose our isotropic reference case, to be a quenched dataset with the Wilson gauge action at /3 = 6.0 as it is well 
known in the literature to have a lattice spacing of O.lfm |l5l|. In the anisotropic case using /3 = 6.1 has approximately 
the same spatial lattice spacing at ^ = 3 and ^ = 6 [la]. We chose the bare gauge anisotropies from the formula 
suggested by [l^. Our gauge production parameters are summarized in Table [H 



D. Tuning the Fermion Parameters 



In this study, wc have opted to use Clover fermions, with tree-level tadpole improved clover coefficients as defined 
in Ref. [6l In the anisotropic cases, wc needed to tune the fermion anisotropy as well as our quark masses to fall within 
our desired range. 
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/3 


V (N^ X Nt) 


79 


Us 


1 


6.0 


16^ X 16 


1 


0.8780 


1 


6.0 


16=^ X 48(t) 


1 


0.8780 


3 


6.1 


16^ X 48 


2.464 


0.8279 


6 


6.1 


16^ X 96 


4.7172 


0.8195 



TABLE I: Parameters of the Quenched Ensembles used in this study. In the first column ^'^ refers to the desired target 
anisotropies, with the final bare gauge anisotropies 7* being shown in the 4th column. Apart from (f) the temporal extent 
grows from Nt = 16 with the anisotropy. (f) was used merely for checking the pion mass at the isotropic parameter set since 
the A^t = 16 case was too short for measuring the pion mass. We also show our measurements for the spatial tadpole coefficient 

Us- 

Our tuning exercise then comprised of choosing trial values, of 7/ for a selection of values for mg, and computing 
the pion dispersion relations for each {mQ,"ff) pair. Since anisotropic tuning is not the main subject of our paper, we 
were content to do this very roughly and were satisfied by a renormalized anisotropy within 10% of our target. To 
compute the dispersion relation, we extracted the ground state energies (Eo) of the pion for several initial momenta 
p, by fitting the pion correlation function: 

Cit,p) = e'^-P{OHx, t)0{Q, 0)) 2Ae-^«^ cosh {e^{^ - A , (45) 

X 

where O is a suitable pion interpolating operator: 

0{x, t) = i){x, t)T^{x, t). (46) 

We used spatial momenta ranging in magnitude from \fp'\ = to ~ (^ttIv") ■ averaged the correlation 

function over the momenta that resulted in equal values of \p^\. Wc used the interpolating operator F = 75 for the 
zero momentum fits, whereas for the finite momentum fits we used F = 7475 as we found the signal to be cleaner. 
Once £^0 was determined for all values of we fitted them to the dispersion relation formula: 

E^ = ^p^ + m^, (47) 

where rh — E'odpl^ — 0), by a straight line fit, to extract ^. To compute our estimate for the pion mass, wc then 
computed the mass in units of the spatial lattice spacing: asWiatt = S.'ni, and then converted this number to physical 
units assuming as = O.lfm. 

The correlation functions themselves were constructed using gaussian gauge invariant source smearing p^ . and 
stout link smearing [l^. No smearing was performed at the sink. Our tuning calculations were carried out using 
40-100 configurations for each (mo, 7/) pair. Wc used the bootstrap method to estimate our errors on the masses and 
fitted anisotropies with 200 bootstrap samples in each case. When estimating the physical pion mass, we added the 
bootstrap errors on ^ and rh in quadrature, rather than under the bootstrap. The results of our tuning are shown in 
table mi wherein we show the fitted fermion anisotropies, and our estimates of the mass of the pion. 

E. Results 

With the tuned fermion parameters in hand, we computed the condition numbers for the various kinds of precon- 
ditionings in our three quenched ensembles. We did not make use of the cutoff trick in our temporally preconditioned 
operators. This time in the isotropic case, we used the Nt — 16 ensemble, since we did not need long time extents for 
fitting. Specifically we computed condition numbers for the unpreconditioned (Unprec), and the 4D Schur even-odd 
preconditioned operator (4D Schur) to use as a standard to compare against, as well as the temporally preconditioned 
operators combined with both 3D ILU even-odd preconditioning (TPrec-|-ILU) and Schur style preconditioning in 
3-dimensions (Tprec-|-3D Schur). We used 19 configurations from each ensemble to measure the condition numbers. 

In figure [1] we show how the condition number of the unpreconditioned operator varies with pion (quark) mass and 
anisotropy. We can see, as one would expect, that the condition numbers increase with decreasing pion (quark) mass 
as well as with increasing anisotropy. In figure [5] we plot the ratios of the condition numbers of the preconditioned 



10 



e 


mo 


V 




E{p = 0) 


m-n (MeV) 


# configs used 




-0.359 


1 


1 


0.383(3) 


766(6) 


100 


1 


-0.379 


1 


1 


0.296(2) 


597(4) 


100 




-0.392 


1 


1 


0.225(2) 


450(4) 


59 


3 


-0.13 


2.95 


3.06(4) 


0.105(2) 


641(15) 


46 


3 


-0.132 


2.96 


3.08(4) 


0.097(1) 


597(12) 


46 


3 


-0.135 


2.96 


3.03(4) 


0.082(2) 


498(15) 


46 


6 


-0.058 


5.43 


5.99(10) 


0.0578(9) 


693(16) 


50 


6 


-0.061 


5.63 


5.96(10) 


0.042(1) 


504(17) 


41 



TABLE II: At each of our target anisotropies we show the bare values of mo, our tuned bare fermion anisotropy 7J, and 
the resulting measured renormalized anisotropies ^ as determined from the pion dispersion relation. We also show the resulting 
pion masses in lattice units {E{p = 0)) and in physical units (m.^-). The ^ = 1.0 results were determined on the 16^ x 48 lattice. 
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FIG. 1: The condition numbers of the unpreconditioned clover operator against the measured pion mass, for all three of our 
anisotropy values: ^ = 1.0 (black circles), ^ = 3.0 (red squares) and ^ = 6.0 (blue diamonds). 
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FIG. 2: Ratios of the condition numbers of the preconditioned operators to the unpreconditioned operator at the same mass and 
anisotropy (lower is better conditioned). The 3 panes are for anisotropies of ^ = 1 (left), f = 3 (middle), ^ = 6 (right). The colors 
are 4D Schur preconditioning (green triangles), temporal preconditioning -I- 3D ILU (black circles), temporal preconditioning 
-|- 3D Schur preconditioning (red squares) 
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operators to the condition number of the unpreconditioned operator. We separate the results into three graphs for 
the three values of ^ used. Separation in terms of 7^ is not practical since for different masses, one can have different 
values of 7/ for the same renormalized anisotropy ^. The ratios give a nice clean signal, as it appears that from 
configuration to configuration, the ratios do not fluctuate very much (although the condition numbers themselves do). 
Hence the average of the ratios is very stable. 

First we can see in the leftmost graph of fig. [21 that in the isotropic case, the greatest gain comes from the 4D 
Schur preconditioning, whose condition number is about 15% of the unpreconditioned (Unprec.) case. Some gain 
over the unpreconditioned case can still be achieved using temporal preconditioning combined with cither 3D Schur 
(TPrec + 3D Schur) or partial temporal preconditioning with ILU in 3D (TPrec + ILU). However the temporally 
preconditioned cases are not as efficacious as the 4D Schur preconditioning in the isotropic case. Looking at the 
middle and rightmost graphs of fig. [5] we see that in the anisotropic cases, the temporally preconditioned operators 
fare much better, with condition numbers that are around 4% - 6% of the upreconditioned one. 

We note that the mass dependence in these ratios appears to be very mild, for a given value of ^. Also with the 
temporal preconditioning we see an improvement in the condition number ratios as ^ is increased. This is presumably 
due to the suppression factors of ^ and ^ in the temporally preconditioned operators. We expect this is because 

of the clover term in the preconditioner which has components in both the spatial and temporal directions, and the 
temporal components will counteract the suppression factors 7/ and 7^ to some degree. It is however encouraging to 
see that the partial preconditioning combined with 3D-ILU even-odd preconditioning in space is nearly as good as 
the 3D Schur preconditioned case. This suggests that not dealing with the clover term in the preconditioner is not 
in fact catastrophic. This is welcome news as the 3D-ILU preconditioning is considerably simpler to implement and 
involves fewer FLOPs than the fully preconditioned 3D-Schur approach. 

We replot the condition number data in figure [31 this time plotting the ratio of the condition number of the 4D 
Schur operator to those of the temporally preconditioned operators, to see if we gain in condition with respect to the 
"standard preconditioning" . The ratios are are defined as 

^4D Schur ^4D Schur ^^43^ 

'^TPrec.+ILu' «TPrcc.+3D Schur 

and hence values larger than one indicates that the temporally preconditioned operator is better conditioned than 
the Schur4D, whereas values less than one indicate that the Schur4D is better conditioned. 

We see, as before, in the leftmost graph of fig. [H that in the isotropic case, neither of the temporal preconditioned 
approaches can beat the 4D Schur preconditioned approach. The condition numbers of the temporally preconditiond 
operators are about 4 times the Schur 4D case. Looking at the middle and rightmost graphs, one sees that the 
TPrec. -I-ILU preconditinong gives a decrease in the condition number of about a factor 2.8-3.8 (depending on ^), 
while the TPrec -I-3D Schur preconditioning gives a decrease of a factor of about 3.3-4.4 compared to the standard 4D 
Schur even odd preconditioning (again depending on ^). 

At this point; we should recall that the numerical overhead of the ILU style preconditioning is between 44%-61% 
depending on implementation. Using the most conservative overhead of 61%, this gives an overall gain in total cost 
for this scheme, compared to the 4D Schur case, of 2.8/1.61 — 3.8/1.61 which equates to between 1.74 — 2.36 for 
anisotropics of ^ = 3 — 6. With the most optimistic overhead estimate (44%) these gains can grow to 1.94 — 2.64. 



IV. CONCLUSIONS 



We have demonstrated the technique of temporal preconditioning for anisotropic discretizations of the Wilson and 
SW Dirac operator in lattice QCD. We discussed the implementation of the technique in an efhcient manner, and 
its combination with further even-odd preconditioning techniques, in particular the 3D-ILU and 3D Schur even-odd 
approaches. 

In the case of Wilson fermions, the 3D-ILU and 3D-Schur approaches are in fact identical and we have shown that the 
only terms in the preconditioned matrix that differ from the identity are suppressed by two powers of the anisotropy. 
For SW fermions, the presence of the clover term complicates the implementation since there arc explicit terms 
that are diagonal in spatial indices and that connect the null spaces of the forward and backward projectors. The 
3D-ILU approach remains straightforward, however, the resulting preconditioned matrix has non-zero off diagonal 
elements, which are only suppressed by factors of ^ rather than 

We implemented the method in Chroma, a general code base for lattice QCD. We have also considered optimization 
techniques suggested from work in the realm of GPGPUs. The appendix shows that the number of floating point 
operations in the temporally preconditioned operator with ILU even-odd preconditioning is about 30% and 61% higher 
than the corresponding 4D Schur preconditioned operator for the Wilson and SW operators respectively in the most 
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FIG. 3: Ratios of the condition numbers of the 4D Schur even-odd preconditioned operator to the two temporally preconditioned 
cases (higher is better conditioned). The 3 panes are for anisotropics of ^ = 1 (left), ^ = 3 (middle), ^ = 6 (right). The colors 
are 4D Schur/TPrec + ILU (black circles), and 4D Schur/TPrec + 3D Schur (red squares) 

generic case. Specialized implementation can reduce floating point the overbad of the Clover action to about 44%. 
These costs may further be ameliorated as needed through judicious use of the cutoff trick. We have not used the 
cutoff trick in the numerical results of this paper. 

We have carried out numerical tests in a variety of quenched ensembles, both isotropic and anisotropic to investigate 
the efficacy of the techniques discussed, using Wilson-Clover fcrmions at a range of quark masses. 

Our chief conclusion is that the technique works well for anisotropic cases. In our studies, with anisotropics of 
^ = 3 and ^ = 6, the temporally preconditioned Clover operator with ILU preconditioning had condition numbers 
between a factor of about 2.8-3.3 times smaller than the usual 4D Schur preconditioned case, whereas the temporally 
preconditioned operator with 3D Schur even-odd preconditioning had condition numbers that were about 3.3-4.4 
times smaller than the usual 4D Schur preconditioned case. Combined with the various overhead estimates, this gives 
the temporally preconditioned, ILU preconditioned clover operator a cost advantage of a factor of 1.74 — 2.36 over 
the more standard 4D Schur preconditioned approach depending on implementation. The 4D Schur preconditioner, 
however, appeared to be the best conditioned in the isotropic case. 

Due to the decreases in condition number observed using the temporal preconditioned methods, it becomes attractive 
to extend the preconditioning scheme to Hybrid Molecular Dynamics-Monte Carlo algorithm such as Hybrid Monte 
Carlo (igj . New terms will arise in the Hamiltonian, due to the determinants of the preconditioning matrices. We 
leave the full discussion of these ramifications to a future publication. 
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1. Appendix — Full Preconditioning with Clover 

In this appendix, wc consider the question of how to deal with the inversion of the term {A + ii ^ Dt), in the case 
of full temporal preconditioning. 
We begin with 



A{x, t) + fi- Dt{x, t) = A{x, t)+fi- P-Ut{x, t)Ss,t+i-s.t' - P+Ul{x, t - l)5s.t-i,x,t 
Inverting A + — Dt can be done with a single step Woodbury procedure: 

A + fi- Dt = To + VW^ 

where 

/ -W{x,Nt~l)P+ \ 





V 







, W^t = (p_^o,0,0,...P+) 



V -U{x,Nt^l)P- I 
With the tridiagonal matrix Tq is now, supressing space indices: 

/ A(0)+M -C/(0)P- 
-C/t(0)P+ ^(1) + Ai -V{^)P- 



Tn = 







V 







-UHNt~5)P+ A{Nt^2) + ^i -f/(iVt-2)P_ 
: -WiNt -2)P+ A{Nt-l) + fi I 



(49) 
(50) 



(51) 



(52) 



This matrix, while easy to apply, is not immediately straighforward to invert, because of its projector structure. In 
our numerical work, to gauge the efficacy of this approach, we used an inner conjugate gradients algorithm to invert 
this matrix. 



2. Floating Point Operation count for ILU Scheme 



Let us recount the the number of floating point operations needed to apply the usual 4D Schur preconditioned 
operator: 



M4DSchur = - ii^- {A-^y- 



(53) 
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where the mass term has been absorbed into the diagonal part of A and D°'^ denotes the 4D Hopping matrix. The 
parameters 7/ have been absorbed into a rescahng of the gauge hnks. We wiU use the notation C{Q) do denote the 
floating point cost of some generic term Q. Applying the A°° term and A'^^ both require 522 FLOPs per site, on V/2 
sites (single checkerboard) each, giving C(A°°) ^CiA"") = 522(F/2) FLOPs 

The D°'^ and terms require 1320 FLOPs each per site on V/2 sites (single checkerboard) each. This number is 
arrived at by considering the spin projection operators as having no floating point operations, since only sign flips and 
exchanges of real and imaginary components are involved. Each SU (3) matrix- vector (matvcc) operation requires 66 
FLOPs, corresponding to three inner products between the three rows of the matrix, and the column vector. Each 
inner product involves 3 complex multiplies (6 FLOPs each) and 2 complex adds (2 FLOPs each), or 3x6 + 2x2 = 22 
FLOPs, giving a total of 3 x 22 = 66 for the three inner products in the complete matvec operation. 

Spin reconstuction (recons) takes 12 FLOPs per site, coming from a single complex add for each of 2 spin-color 
components (6 complex-adds in total); again, not counting sign flips and real complex component interchanges. Finally, 
24 FLOPs are required to sum two (now reconstructed) 4 spinors (sumvec4) to evaluate the sum over directions (4 
spin-color components, so 12 complex components in total, with 2 FLOPs per component). 

In Nd dimensions 



C(D°^) = 2Ar^(iV* matvec -I- recons) + sumvec4 x {2Nd - 1) (54) 

where the factor of 2Nd comes from doing forward and backward projections and reconstructions in Nd dimensions, 
and N* is the number of spin components left after spin projection. So, in 4-dimensions, with = 4 we have 
iVd = 4, TV* = 2 and one has: 

C{D'"') = 8(2 X 66 + 12) + 24 X 7 = 1320 FLOPs (55) 

Correspondingly, per site the 4D Schur Operator takes up 

C(M4DSchur) = 2 X (1320 + 522) + 48 = 3732 FLOPs (56) 

where the last factor of 48 comes from the AXPY operation to apply the factor of j and subtracting the two terms 
from each other. Hence, applying the operator costs 3732(^/2) FLOPs in total, which it is convenient to re-express as 
3732Nt{Vs/2), with Nt the extent of the time direction, and Vs being the number of spatial coordinates per timeslice. 
Let us now consider the ILU preconditioned operator 

M:.u=fl+^^^ - -1-1 (57) 

where we have absorbed the factors of ^ into the D terms. Applying A/jlu to some vector to result in x = TI/iluV' 
implies that 

Xe = i'e+A-^D'°^o + ^e) (58) 

Xo = + + {A^^ [D'^o + ^e] - D'°i^o] (59) 

and one can reuse several terms between Xe and Xo- Thus Milu can be efficiently applied by computing in sequence: 

Xe = V'e (60) 

Xo = ipo (61) 
h = D'^^o (62) 

t2 = A-(ti+Ve) (63) 
Xe = Xe+t2 (64) 

Xo = Xo + A^^^Po + D'>^t2 - h) (65) 

and apart from 5 vector additions, we need to apply A and D twice each (once per checkerboard). Since A = ClACr 
and D = ClDCr we have 

C(AfiLu) = 5 X 24 X (y/2) + 2 [C(L',) + C(i)] = 120(1^/2) + 2 [2C{Cl) + 2C{Cr) + C{A) + C{D,)] (66) 



where the 5 x 24 x {V/2) term is to account for the 5 vector adds, each on a single checkerboard. 



15 



We still have C{A) = 522 per site and by substituting Nd = 3 into the discussion for C{D) one finds that applications 
of the 3D Dg operator cost 984 FLOPs per site. The preconditioners Cl and Cn have the same cost each namely 
C(C) and so 

C(MiLu) = 120(y/2) + 2 [4C(C) +C{A)+ 0(0^)] = 120(V"/2) + 8C(C) + (2 x 522 + 2 x 984)(y/2) = 8C(C) + 3132(F/2) 

(67) 

Applying C requires a back (forward) subsitution for each spatial site of the appropriate checkerboard. We consider 
the backsubsitution here, but the working is similar (and the FLOP count is identical) for the forward substitution. 
The back subsitution needs to be performed on only 2 spin color components, after spin projections with either 1 +70 
or 1 — 7o for Cl or Cr, followed by an appropriate reconstruction. After the backsubsitution the Woodbury procedure 
involves for each spatial site, working with length Nt block vectors, a matrix multiplication by the a pre-computed 
SU{S) matrix, and an SU{3) subtraction. As usual we do not count any floating point operations for the projection 
part of the spin projection steps. This gives 

C{C) = n; [{Vs/2) {C{B) + aW))] + {V/2)(U + 48) (68) 

where Vg denotes the number of spatial sites, C{B) denotes the cost for the back substitution, C{W) denotes the rest 
of the Woodbury process, the 12(^/2) FLOPs come from the spin reconstruction on one chckcrboard and the 48(^/2) 
FLOPs come from adding the P+ and P_ terms in Cl and Cp ; one vector addition costing 24 FLOPs and an overall 
scaling by 5 to normalize the projectors costing another 24 FLOPs. 

We now need to consider the backsubstution on a single spin color component: The first step is just a scaling by the 
diagonal ipNt-i = jiXNt-i or 6 FLOPs. Then there follow A^t — 1 steps of = ^ [Xi ^ Ui^/ji+i], each one comprised 
of an SU{3) matrix vector multiply (66 FLOPs) and a subtraction and a scaling by ^ (6 FLOPs each). In total: 

C(B) = 6 + {Nt - 1) (66 + 6 + 6) = 78A^t - 72 FLOPs (69) 

The remainder of the Woodbury procedure involves computing the XAVF^ term. This can be achieved by pre- 
computing XAH^^ for each value of t at initialization, and then this process costs only Nt matrix vector operations 
(66FLOPs each), and finally we need to subtract the result of this matrix multiply from the result of the back/forward 
substitution (6 FLOPs) for each value of Nt and so 



and 



C(W) = (66 + 6)Nt = 72Nt FLOPs (70) 



C(C) = N*{V,/2)[78Nt- 72 + 72Nt] + {V/2)[12 + A8] 

= N*{Vs/2)[l50Nt-72]+60{V/2) 

= {Vs/2) {300Nt - 144) + 60Nt{Vs/2) 

= {Vs/2) {360Nt ~ 144) FLOPs 

where we have used VgVt = V, and A^* = 2 so 

C(A/iLu) = 8C{C) + 3132{V/ 2) 

= 8{Vs/2) {360Nt - 144) + 3132Nt{Vs/2) 
= {Vs/2) {2880Nt - 1152) + 3132Nt{Vs/2) 

= {6012Nt - 1152) {Vs/2) FLOPs (71) 
Comparing the costs of the two preconditioned operators wc have for Clover fermions 

^ ^ C(AfiLu) ^ 6012Af, ~ 1152 _ ^ _ 0^ ^^2) 
C(M4DSchur) 3732A^t ~ ■ Nt 

and so we consider two limiting cases: in the first instance when A^t = 1 we have R = 1.302 to 3 decimal places (d.p.) 
and when Nt is sufficiently large that the term involving it is negligible we have R = 1.611 to 3 d.p. In a typical case, 
when Nt = 128 - 256 one has i? w 1.61 to 2 d.p. 

Similarly wc can consider the case for just Wilson fermions. In this case 

M4DSchur = M - ^D°-D-" (73) 
4// 
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and the cost is 

C(M4DSchur) = (48 + 2 X 1320)iVt(ys/2) = 2mmt{Vs/2) FLOPs (74) 

where the 48FLOPS comes from the subtraction, and scahng (AXPY) operation. 
The temporaUy preconditioned scheme needs only the evaluation of 



MiLU = / - DTD'° (75) 



and so the cost in flops is: 



C(MiLu) - 2^{V/2)+AC{C) + 2C{D,) 

^ 2ANt{Vj2) + 4(K,/2) (360Nt - 144) + 2 x 9847Vt(y^/2) 

= (14/2) (24iVt + 1440iVt- 576 + 19687Vt) 

= (14/2) (3432iVt - 576) FLOPs (76) 



and we have 



i? =^1^^^. 1.286 - 0.216-1 (77) 
2668iVt Nt ^ ' 

giving R in the range of i? « 1.07 — 1.286. 

Use of the cutoff trick removes the multiplication by components of X and subtraction of the term Xq in steps 2(b) 
and 2(c) of the Sherman-Morrison- Woodbury process. If the cutoff value of t is k, one saves k timeslices; from t = 
to i = fc — 1; per spatial site in the ILU Clover operatorevaluting or (T^) , on each of N* = 2 spin components. 
On one spin component one saves 66-1-6 flops, and so per spatial site, one saves 2 x (66 -f- 6) = 144 FLOPs per per 
timeslice; altogether 144fc FLOPS. One does this every and C/j of which there are 4 in the Wilson case, and 8 in 
the ILU preconditioned Clover case, each of which is evaluated on a single checkerboard {Vs/2 sites). Correspondingly 
the cutoff trick on k timeshces saves 576A:(V;/2) FLOPs for the ILU Wilson and 1152fc(14/2) FLOPs for the ILU 
Clover operator. 



3. Dirac Basis and Axial Gauge 

The use of the Dirac basis and the Axial gauge was advocated in [l^, [l^l to save FLOPs and memory bandwidth 
in the implementations of the Dirac operator on GPU systems. The use of these techniques is also beneficial in the 
case of temporal preconditioning. We analyze below the benefits for temporal preconditioning in terms of the FLOPs 
savings, but note also, that savings in memory bandwidth resulting from the use of these techniques can also help the 
performance of the implementations. 

The use of the Dirac basis, where 70 is diagonal, can save the cost of spinor reconstruction in the time direction, 
saving some 48 FLOPs per site in the 4-dimensional D operators - 12 each in the forward and backwards time 
directions respectively from not having to reconstruct, and another 12 each when accumulating since now only half 
vectors need to be accumulated, rather than the reconstructed 4 vectors. This corresponds to a saving of 48Nt{Vs/2) 
FLOPs, per D operator, or a total of 96Nt{Vs/2) when considering the full 4D Schur preconditioned operator (where 
D is applied twice). 

Correspondingly, this trick can save the (y/2)(12 + 48) = 6QNt{Vs/2) FLOPs term from the cost of applying Cl 
and Cr since no spinor reconstruction is needed and one does not need to expend FLOPs when adding the results 
of the P+ and P_ projectors, as the projectors will simply write their results to different components of the final 4 
spinor. In the case of temporal preconditioned clover fcrmions, where the preconditioncr is used 8 times overall this 
results in a saving of A80Nt{Vs/2) FLOPs. In the case of unimproved fcrmions, where the preconditioncr is used only 
4 times one can save 2407Vt(T4/2) FLOPs. 

Use of the axial gauge allows one to save further FLOPs. In this case all the links in the temporal direction (apart 
from the boundary) are transformed to have value Ui = 1. In the case of the 4D D operator, this saves 2 SU(3) 
matrix vector multiplies per spin component, coming from the forward and backward temporal link matrices on the 
non-boundary sites. To make counting easier, we'll assume the number of boundary sites is negligible and assume 
the saving for every site. Hence we count 2A^* x 66 = 264 FLOPs saved per site. Two applications of D are needed 
for the 4D Schur preconditioned operators, so this trick saves roughly 264 FLOPs per site or 528Nt{Vs/2) FLOPs in 
total. 
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Correspondingly the Nt — I back(forward) substitution steps change from V'i = ^ [x ^ Uiipi+i] to merely tpi = 
i [x — i/ji+i] s-nd one saves Nt — 1 matrix multiplies in applying the preconditioner for each spin component, leading 
to a cost saving of 66N*{Nt — 1) = 132Nt — 132 flops per spatial site. 

In the case of Clover fermions, where 8 applications of the preconditioner needed, the saving is {1056Nt~ 1056)1^/2 
FLOPSs, whereas in the case of unimproved Wilson, where 4 applications arc needed the saving is (528iVt — 528)(Fs/2). 

Combining the savings from the use of the Dirac basis and the use of the axial gauge one gets, for the cost of the 
temporally preconditioned Clover Operator: 

C(MiLu) = (44767Vt - m){Vj2) (78) 

which is a saving of roughly 25% over the previous cost in eq. [TTJ The cost of the 4D Schur preconditioned operator 
also reduced: 

C(Af4DSchur) = 3108Nt{Vj2) (79) 
giving the overhead from the preconditioning to be: 

i? = 1.440-^ (80) 

Nt ^ ' 

In the unimproved Wilson Operator one has 

C(A^iLu) = (2664iVt - 48)(y,/2) (81) 

corresponding to a saving of about 22% over the previous case in eq. [76l 
The cost of the 4D Schur preconditioned operator becomes: 

C(M4DSchur) = 2044iVt(y,/2) (82) 

giving 

i?=1.30-^. (83) 

Nt ^ ' 



Thus, the overhead of temporal preconditioning is between roughly 41-43% for Clover, and 27-30% for the unim- 
proved Wilson Case. 



